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Abstract. Using a time series model to mimic an observed time series has a long history. 
However, with regard to this objective, conventional estimation methods for discrete-time 
dynamical models are frequently found to be wanting. In fact, they are characteristically 
misguided in at least two respects: (i) assuming that there is a true model; (ii) evaluating 
the efficacy of the estimation as if the postulated model is true. There are numerous ex- 
amples of models, when fitted by conventional methods, that fail to capture some of the 
most basic global features of the data, such as cycles with good matching periods, singular- 
ities of spectral density functions (especially at the origin) and others. We argue that the 
shortcomings need not always be due to the model formulation but the inadequacy of the 
conventional fitting methods. After all, all models are wrong, but some are useful if they 
are fitted properly. The practical issue becomes one of how to best fit the model to data. 

Thus, in the absence of a true model, we prefer an alternative approach to conventional 
model fitting that typically involves one-step-ahead prediction errors. Our primary aim is 
to match the joint probability distribution of the observable time series, including long- 
term features of the dynamics that underpin the data, such as cycles, long memory and 
others, rather than short-term prediction. For want of a better name, we call this specific 
aim feature matching. 

The challenges of model misspecification, measurement errors and the scarcity of data 
are forever present in real time series modeling. In this paper, by synthesizing earlier at- 
tempts into an extended-likelihood, we develop a systematic approach to empirical time 
series analysis to address these challenges and to aim at achieving better feature match- 
ing. Rigorous proofs are included but relegated to the Appendix. Numerical results, based 
on both simulations and real data, suggest that the proposed catch-all approach has sev- 
eral advantages over the conventional methods, especially when the time series is short or 
with strong cyclical fluctuations. We conclude with listing directions that require further 
development. 
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1. INTRODUCTION 

Dynamical models, either in continuous time or 
in discrete time, have been widely used to describe 
the changing world. Interestingly, salient features 
of many seemingly complex observations can some- 
times be captured by simple dynamical models, as 
demonstrated most eloquently by Sir Isaac Newton 
in the seventeenth century when he used his model, 
Newton's law of universal gravitation, to explain Ke- 
pler's observations concerning planetary motion. In 
statistics, dynamical models are the raison d'etre of 
time series analysis. For a time series, the dynamics 
transmits information about its future from observa- 
tions made in the past and the present. Of particu- 
lar interest are the long-term future, the periodicity 
and so on. To capture salient features, there are es- 
sentially two approaches: substantive and black-box. 
Examples of both approaches abound. The former 
is often preferred if available in the context in which 
we find ourselves. If not available, then a black-box 
approach might be the only choice. We shall include 
examples of both approaches. Let us first mention 
two substantive examples as they are relevant to our 
later discussion. 

1.1 Two Substantive Models and Related 
Features 

1. Animal populations. There are numerous eco- 
logical models describing the time evolution of ani- 
mal populations. The single-species model of Oster 
and Ipaktchi (1978) can be written as 



(1.1) 
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where xt is the number of adults at time t; r is the 
delayed regulation duration due to the time taken 
for the young to develop into adults or discrete breed- 
ing seasons; &(•) is the birth rate; and \x is the death 
rate. There are different specifications for &(•). Gur- 
ney, Blythe and Nisbet (1980) suggested b{u) = 
cexp(— u/Nq), where A^o is the reciprocal of the ex- 
ponential decay rate and c is a parameter related 
to the reproductive rate of adults. Ellner, Seifu and 
Smith (2002) investigated the estimation of model 
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(1.1) by replacing b(xt- T )xt-r and fixt with un- 
known functions B(x t ~ T ) and D(xt), respectively, 
which they then used a nonparametric method to es- 
timate. Wood (2001) considered a similar approach. 
There are several discrete-time versions of (1.1) in 
biology. See, for example, Varley, Gr ad well and Has- 
sell (1973). If we approximate dxt/dt by xt — x t -i, 
then we obtain a nonlinear time series model in dis- 
crete time 



1.21 



x t = b(x t - T )xt- T + vxt-i, 



where v = 1 — /i. 

In ecology, population cycles are often observed 
and are an issue of paramount importance. For ex- 
ample, the blowfly data show a cycle of 39 days and 
the Canadian lynx shows a cycle of about 9.7 years. 
Some ecologists have even suggested chaotic pat- 
terns, although we are skeptical about this possibil- 
ity. Most ecologists consider the dynamics underly- 
ing population cycles as one of the major challenges 
in their discipline. 

2. Transmission of infectious diseases. The conven- 
tional compartmental SIR model partitions a commu- 
nity with population N into three compartments St 
(for susceptible), It (for infectious) and Rt (for re- 
covered) : N = St + It + Rt at any time instant t . The 
SIR model is simple but very useful in investigating 
many infectious diseases including measles, mumps, 
rubella and SARS. Each member of the population 
typically progresses from susceptible to infectious to 
recovered or death. 

Infectious diseases tend to occur in cycles of out- 
breaks due to the variation in the number of sus- 
ceptible individuals over time. During an epidemic, 
the number of susceptible individuals falls rapidly 
as more of them are infected and thus enter the in- 
fectious and recovered compartments. The disease 
cannot break out again until the number of suscep- 
tible has built back up as a result of babies being 
born into the susceptible compartment. 

Consider a population characterized by a death 
rate \i and a birth rate equal to the death rate, in 
which an infectious disease is spreading. The differ- 
ential equations of the SIR model are 

§=4* _(„+„),„ 



dRt 
dt 



—7T = v h ~ fJ'Rt 
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where /3 is the contact rate and v is the recovery 
rate of the disease. See, for example, Anderson and 
May (1991) and Isham and Medley (2008) for de- 
tails. This model has been extensively investigated 
and very successfully used in the control of infec- 
tious diseases. Discrete-time versions of the model 
have been proposed. An example is 

I t+ i = r S t I t /N, St+i = S t - It+i + nN, 

where \x is the birth rate and ro is the basic re- 
productive rate of transmission. See, for example, 
Bartlett (1957, 1960), Anderson and May (1991) and 
the discussion in Section 6. 

Again, an important feature for the transmission 
of infectious disease is the periodicity, to understand 
which it is essential to understand the effect of such 
factors as the birth rate, the seasonal force, the trans- 
mission rate and the incubation time on the dy- 
namics, the phase difference that is related to the 
transmission in different areas, and the interaction 
between different diseases; see, for example, Earn 
et al. (2000) and Rohani et al. (2003). The model 
can also be used to guide the policy maker in con- 
trolling the spread of the disease. See, for example, 
Bartlett (1957), Hethcote (1976), Keeling and Gren- 
fell (1997) and Dye and Gay (2003). 

1.2 The Objectives 

Our primary concern is parametric time series mo- 
deling with the objective of achieving good matching 
of the joint probabilistic distribution of the observ- 
able time series, including, in particular, salient fea- 
tures, such as cycles and others. Short-term predic- 
tion is secondary in this paper. Accepting G. E. P. 
Box's (1976) dictum: All models are wrong, but some 
are useful, we use parametric time series models only 
as means to an end. We are typically less interested 
in the consistency of estimators of unknown parame- 
ters in the conventional sense, which is predicated on 
the assumed truth of the postulated model. In fact, 
we are more interested in improving the matching 
capability of the postulated model. 

Suppose we postulate the following model: 

(1.3) x t = ge(xt-i,...,x t -p) + e t , 

where et is the innovation and the function ge(-) is 
known up to parameters 9. To indicate the depen- 
dence of xt on 9, we also write it as xt(9). Following 
Tong (1990), we call (1.3) with Var(e t ) = the skele- 
ton of the model. In postulating the above model, we 



recognize that it is generally just an approximation 
of the true underlying dynamics no matter how the 
function ge(-) is specified. Of particular note is the 
fact that conventional methods of estimation of 9 in 
the present setup are usually not different from those 
used for a cross-sectional model: with observations 
{yi , 2/2 j • • • , U T } and postulated model gg , typically 
a loss function is based on the errors and takes the 
following form: 

T 

L(9) = (T- P r 1 {yt-g 9 (yt-i,...,yt- P )} 2 , 
t= P +i 

where, here and elsewhere, T denotes the sample 
size. The errors above happen to coincide with the 
one-step-ahead prediction errors. Under general con- 
ditions, minimizing this loss function is known math- 
ematically to lead to efficient estimation if the postu- 
lated model is true. However, the postulated model 
is, by the Box dictum, almost invariably wrong, in 
which case the above loss function is not necessar- 
ily fit for purpose. To illustrate, let observations 
{2/1)2/2, • • • , yr} be given and, of the postulated mo- 
del (1.3), let the function gg be linear and et be 
Gaussian with zero mean and finite variance. Let 
T = {C(j),j = 0, 1, 2, . . . , T — 1} denote a set of sam- 
ple autocovariances of the y-data. Then minimiz- 
ing L(9) yields well-known estimates of 9 that are 
functions of S = {(7(0), C(l), . . . , C(p)}. If the pos- 
tulated model is "right," then S is a minimal set of 
sufficient statistics (ignoring boundary effects) and 
all is well. However, if it is wrong, then it is unlikely 
that S will remain so. Since the model is typically 
wrong, then restricting to S is unfit for the purpose 
of estimating 9; T may be preferable. 

To reconcile with the Box spirit, diagnostic checks, 
goodness-of-fit tests and other post-modeling devices 
are recommended. Indeed Box and Jenkins (1970) 
have stressed these post-modeling devices. See also 
Tsay (1992) for some later developments. These are 
undoubtedly very important developments. However, 
the challenge remains as to whether we can adopt 
the Box spirit more seriously right at the modeling 
stage rather than at the post-modeling stage. 

It is worth recalling the fact that the classic au- 
toregressive (AR) model of Yule (1927) and the mov- 
ing average (MA) model of Slutsky (1927) were orig- 
inally proposed to capture the sunspot cycle and the 
business cycle, respectively, rather than for the pur- 
pose of short-term prediction. 
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2. THE MATCHING APPROACH 

We shall use the letters y and x to signify respec- 
tively the real time series under study and the time 
series generated by the postulated model. The ad- 
jective observable is reserved for a stochastic pro- 
cess. An observed time series consisting of observa- 
tions constitutes (possibly part of) a realization of 
a stochastic process. In order for model (1.3) to be 
able to approximate an observable {yt : t = 1, 2, . . .} 
well, it is natural to require throughout this paper 
that the state space of {xt(6):t = 1,2,...} covers 
that of the observable {yt : t = 1, 2, . . .}. For exposi- 
tional simplicity, let p = 1. Starting from xq(6) = yo, 
the postulated model is said to match an observable 
time series under study perfectly if their conditional 
distributions are the same, namely, 

P{x 1 (9 ) < ui, . ..,x n (8 ) < u n \x (9 ) = y } 

(2.1) 

= P{yi <ux,...,y n <u n \y } 

almost surely for some Oq and any n and any real 
values u\ , . . . , u n . We call the approach based on the 
above model, including all its weaker versions, some 
of which will be described in the next two subsec- 
tions, collectively by the name catch-all approach. 

However, formulation (2.1) is usually quite difficult 
to implement in practice. In the next two subsec- 
tions, we suggest two weaker forms, although other 
forms are obviously possible. 

In the econometric literature, the notion of calibra- 
tion has been introduced (e.g., Kydland and Prescott, 
1996). It has many alternative definitions. Broadly 
speaking, calibration consists of a series of steps in- 
tended to provide quantitative answers to a particu- 
lar economic question. A crucial step involves some 
so-called "computational experiments" with a sub- 
stantive model of relevance to economic theory; it 
is acknowledged that the model is unlikely to be 
the true model for the observed economic data. At 
the philosophical level, calibration and our feature 
matching share almost the same aim. However, there 
are some fundamental differences in methodology. 
Our methodology provides a statistical and coher- 
ent framework (in a non-Bayesian sense) to esti- 
mate all the parameters of a postulated (and usually 
wrong) model. As far as we know, calibration seems 
to be in need of such a framework. See, for example, 
Canova (2007), esp. page 239. The hope is that our 
methodology will be useful to substantive modelers 
in all fields, including ecology, economics, epidemiol- 
ogy and others. At the other end of the scale, it has 
been suggested that our methodology has potential 
in data mining (K.S. Chan, private communication). 



2.1 Matching Up-to-m-Step- Ahead Point 
Predictions 

If we are interested in the mean conditional on 
some initial observation, say yo, we can weaken the 
matching requirement (2.1) to 

E[(x l (9 ),...,x m (6 ))\x (9o) = y } 

= E[{yi,...,y m )\y ], 

where the length m of the random vector is, in prac- 
tice, bounded above by the sample size under con- 
sideration. The expectation is taken with respect to 
the relevant joint distribution of the random vec- 
tor conditional on the initial value being yo- Since 
a postulated model is just an approximation of the 
underlying dynamics, we set 9q to minimize the dif- 
ference of the prediction vectors, that is, 

E{\\E[(x 1 (e),...,x m (e))\x (e) = y } 

(2.2) 

-E[(y u ...,y m )\y }\\ 2 }. 

Here, || • || denotes the Euclidean norm of a vector. 
In other words, we choose 9 by minimizing up-to- 
m-step-ahead prediction errors. It is basically based 
on a catch-all idea. It is easy to see that the best 6 
based on minimizing (2.2) depends on m. Generally 
speaking, we set m = 1, when and only when we 
have complete faith in the model, which is what the 
conventional methods do. Denote the m-step-ahead 
prediction of yt+ m based on model (1.3) by 

dg^ (Vt) = E(x t+m \x t = y t ). 
If model (1.3) is deterministic [i.e., Var(e^) = 0] or 
linear, g^iyt) is simply a composite function, 

dt^ivt) = 9e{ge{- ■ ■ 9e{yt) ■■■))■ 

V v ' 

m folds 

Let 

Q(yt,xM) 

oo 

= SUp ^ Wrn[E{y t+ m ~ 9^ (j/t)} 2 ], 
W ™ m=l 

where w m > and Yl w m = 1 • Since 

H&t+m - B(y t+m \y t )}{E(y t+m \y t ) - g [ ™\y t )}\ = 0, 

we have 

^{yt+m - (vt)} 2 = E{yt+ m - ^(yt+ m \yt)} 2 

+ B{B(y t+m \y t )-g^\y t )} 2 . 
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Let 

oo 

= su p Yl w m[V{E{yt+m\yt) - fljrkift)} 2 ]- 

Wm m=l 

If the observable yt indeed follows the model of xt, 
then miiig Q(yt, xt(0)) = 0. Otherwise we generally 
expect mm e Q(y t ,x t (9)) > 0. Minimizing Q(yt,x t (9)) 
is for xt(0) to arrive at a choice within the postulated 
model that gives all (suitably weighted) multiple- 
step-ahead predictions of yt as accurately as possible 
in the mean squared sense. 

Note that the above measure of the difference be- 
tween two time series is based on a (weighted) least 
squares loss function. Clearly there exist many other 
possible measures. For example, if the distribution of 
the innovation is known, a likelihood type measure 
of the difference can be used instead. A Bayesian 
may perhaps then endow {w m } with some prior dis- 
tribution. This line of development may be worth 
further exploration as suggested by an anonymous re- 
feree. Intuitively speaking, a J-shaped {w m } tends to 
emphasize low-pass filtering, because E(yt+ m \yt) is 
a slowly varying function for sufficiently large m. Si- 
milarly, an inverted- J-shaped {w m } tends to empha- 
size high-pass filtering. An optimal choice of {w m } 
strikes a good balance between high-pass filtering 
and low-pass filtering. 

The most commonly used estimation method in 
time series modeling is probably that based on min- 
imizing the sum of squares of the errors of one-step- 
ahead prediction. This has been extended to the sum 
of squares of errors of other single-step-ahead pre- 
diction. See, for example, Cox (1961), Tiao and Xu 
(1993), Bhansali and Kokoszka (2002) and Chen, 
Yang and Hafner (2004). Clearly, the former method 
is predicated on the model being true. The latter 
extension recognizes that this is an unrealistic as- 
sumption for multi-step-ahead prediction. Instead, 
a panel of models is constructed so that a different 
model is used for the prediction at each different 
horizon. The focus of the extension is prediction. 

The approach that we develop here essentially 
builds on the above extension. First, we shift the fo- 
cus away from prediction. Second, we transform the 
prediction based on a panel of models into the fitting 
of a single time series model. We effectively synthe- 
size the panel into a catch-all methodology. Specifi- 
cally, we propose to minimize the sum of squares of 
errors of prediction over all (allowable) steps ahead, 



as given in (2.3). We stress again that our primary 
objective is feature matching rather than prediction. 
Of course, it is conceivable that good feature match- 
ing may sometimes lead to better prediction, espe- 
cially for the medium and long term. Clearly each 
member of the panel can be recovered, at least for- 
mally, from the catch-all setup by setting, in turn, 
the weight, Wj, to unity, leaving the rest to zero. 

2.2 Matching ACFs 

Suppose that the observable {yt} and {x t (9)} are 
both second-order stationary. If we are interested in 
second-order moments, then a weaker form of (2.1) 
is the following difference or distance function: 

oo 

D c (y t ,x t {9))= sup ^ w m{lx(6)( m ) ~ ly( m )} 2 ■ 

{w m } m=0 

Here, the suffixes of y and x(9) are self-explanatory. 
We assume that the spectral density function (SDF) 
of the observable yt exists; it is given by 

1 1 °° 

fv( u ) = 2^T(°) + -^ly{k)cos(kuj). 
k=l 

The SDF of xt(9), which we also assume to exist, 
can be defined similarly. We can also measure the 
difference between two time series by reference to 
the difference between their SDFs, for example, 

which is called the Itakura-Saito distortion measure; 
see also Whittle (1962). Further discussion on mea- 
suring the difference between two SDFs can be found 
in Georgiou (2007). 

Suppose that {xt (9)} and the observable {yt} have 
the same marginal distribution and they each have 
second-order moments. Then we can prove that 

D F {yt,x t {9))<C 2 Q{yt,x t {9)) 

for some positive constants C\ and C2. Moreover, 
if {xt{9)} and the observable {yt} are linear AR 
models, then there are some positive constants C3 
and C4 such that 

Q(yuXt(9))<C 3 D c (yt,x t (9)), 

Q(yt,x t (9))<C 4 D F (yt,xt(9)). 

For further details, see Theorem A in the Appendix. 
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For linear AR models under the above setup, Q(-, •), 
Dq ■(-,•) and Dp(-,-) are equivalent. However, the 
equivalence is not generally true. A counterexample 
can be constructed easily by reference to the classic 
random telegraph signal process. [See, e.g., Parzen 
(1962), page 115]. 

Let us close this section by describing one way 
of implementing the ACF criterion for an ARMA 
model with normal innovation. Suppose yi,...,y T 
are observations from the observable {yt}- Whittle 
(1962) considered a "likelihood function" for ARMA 
models in terms of the SDF. Let 

1 — 

^Jy t exp(-iwt) 

t=i 

be the periodogram of the sample, where i is the 
imaginary unit. Let fe(oj) be the theoretical SDF of 
an ARMA model with parameters 9. Whittle (1962) 
proposed to estimate 9 by 



I(w) 



2ttT 



mm 



T 

£ 

3=1 



+ log(Mw i )) , 



where uj = 2irj/T. From the perspective of feature 
matching, the celebrated Whittle's likelihood is not 
a conventional likelihood but a precursor of the ex- 
tended-likelihood approach. It matches the second- 
order moments, by using a natural sample version 
of Dp{yt,xt{9)) up to a constant. For this reason, 
it is expected that for misspecified models, Whit- 
tle's estimator can lead to better matching of the 
ACFs of the observed time series than the innova- 
tion driven methods [e.g., the least squares estima- 
tion (LSE) or the maximum likelihood estimation 
(MLE)]. We shall give some numerical comparison 
between Whittle's estimator and the others in Sec- 
tions 5 and 6 below. 

3. TIME SERIES WITH MEASUREMENT 
ERRORS 

To illustrate the advantages of the catch-all ap- 
proach, which involves minimal assumptions on the 
observed time series, we give detailed analyses of two 
cases involving measurement errors, one of which is 
related to a linear AR(p) model and the other a non- 
linear skeleton model. They can be considered spe- 
cial cases of model misspecification in that the ob- 
servable y-time series is a measured version of the 
x-time series subject to measurement errors. For the 
linear case, measurement error is an old problem in 
time series analysis that was studied at least as early 
as Walker (1960). Some new lights will be shed. 



3.1 Linear AR(p) Models 

Consider the following AR(p) model: 
(3.1) x t = 9 1 x t -i-\ \-9 p x t -p + e t . 

Stationarity is assumed. By the Yule-Walker equa- 
tions, we have the recursive formula for the ACF, 
{tOOIj °f the x-time series, namely, 



(3.2) 



7 (fc) = 7 (A:-l)0i +7(fc- 2)9 2 + ••• 
+ -y(k-p)0 p , k = l,2, — 



Let m > p and T r 
(9 1 ,...,9 P ) T and 

/ 7(0) 
7(1) 



( 7 (l), 7 (2),..., 7 (m))" 



7("1) 
7 (0) 



7(-P+l)\ 
7(-P + 2) 



\ 7 (m — 1) j(m — 2) ••• 7(771— p) J 
The Yule- Walker equations can be written as 

f — 

Suppose that the observable y-time series is given 
by yt = Xt + r)t, for t = 0, 1, 2, . . . , where {rjt} is inde- 
pendent of {xt} and is a sequence of independent 
and identically distributed random variables each 
with zero mean and finite variance. Clearly, {yt} is 
no longer given by an AR(p) model of the form (3.1). 

Let { 7 (j)} denote the ACF of the observable y- 
time series. Let t m and T m denote the analogously 
defined matrix and vector of ACFs for the observ- 
able y-time series. 

Suppose we are now given the observations {yi,y2, 
. ..,?/t}) an d we wish to fit the wrong model of 
the form (3.1) to them. We may estimate 7 (j) by 

i{j) = i{-j) = T ~ l Y^=i{yt-y){yt+j-y), y being 

the sample mean. Let T m and T m denote the obvi- 
ous sample version of T m and sample version of T m , 
respectively. 

Since any p equations can be used to determine 
the parameters, the Yule- Walker estimators typi- 
cally use the first p equations, that is, 



or 9 = {Y T p Y p ) 



1 p 1 p> 



which is also the minimizer of X^fe=i{7(^) — 7(^ — 

1)01 _ l{k - 2)6*2 j(k -p)9 p } 2 , involving the 

ACF only up to lag p. We can achieve closer match- 
ing of the ACF by incorporating lags beyond p as 
well. For example, we may consider estimating 9 by 
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minimizing 

m 

k=l 

-%k-2)9 2 A /(k-p)9 p } 2 , m>p. 

Denoting the minimizer by #{ m }, we have 

(3.3) 

{rn ~j 1 m 

k=0 J fc=0 

where = (7(/c),7(fc + l), . . . , ^{k+p — 1)) T . Let us 
call the estimator 9{ m y the up-to-lag-m Yule-Wal- 
ker estimator (or AYW(< to)). For the error- free 
case, that is, r] t = with probability 1, it is easy to 
see that 9{ v } is the most efficient amongst all 9{ m y, 

to = p,p + 1, Otherwise, under some regularity 

conditions, we have in distribution 

V^{d {m }-$}^N(o,t m ), 

where $ = (T^ n T Tn )~ 1 r^ n T m and S m is a positive de- 
finite matrix. For Var(e 4 ) > and Var(rft) = a 2 > 0, 
the above asymptotic result holds with ?9 = + 
cj2(r^r m + 2tf2r p + ^/)- 1 (r p + cr*/)0. For further 
details, see Theorem B in the Appendix. 

Clearly the bias o-*(T£r m + 2a 2 T p + o±I)- l {T p + 
ozT)0 in the estimator will be smaller when to, is lar- 
ger. For sufficiently large sample size, the smaller bias 
can lead to higher efficiency in the sense of mean 
squared errors (MSE). Let Tfc = (^(k), j(k + 1), . . . , 
j(k+p- 1)) T . Then 

m 
k=p 

Thus, the bias can be reduced more substantially 
if the ACF decays very slowly and a larger m is 
used. For example, a highly cyclical time series usu- 
ally has slowly decaying ACF, in which case the 
AYW will provide a substantial improvement over 
the Yule- Walker estimators. However, even with the 
ACF slowly decaying, a large to may cause larger 
variability of the estimator. Therefore, a good choice 
of to is also important in practice. We shall return 
to this issue later. 

In fact, Walker (1960) suggested using exactly p 
equations to estimate the coefficients giving 

{2p-i+i -\ -i 2p-i+e 
^2 ^k^k\ ^2 ^k7(k + l). 
k= P +e J k= P +e 



Note the difference between AYW and 9w.i- Walker 
(1960) showed that in the presence of measurement 
error, then £ = p is the optimal choice amongst all 
candidates with £>p, by reference to MSE. How- 
ever, Walker's method seems counterintuitive be- 
cause it relies on the sample ACF at higher lags to 
a greater extent than those at the lower lags. Fur- 
ther discussion on Walker's method can be found 
in Sakai, Soeda and Tokumaru (1979) and Stauden- 
mayer and Buonaccorsi (2005). It is well known that 
an autoregressive model plus independent additive 
white noise results in an ARMA model. Walker's ap- 
proach essentially treats the resulting ARMA model 
as a true model. This approach has attracted atten- 
tion in the engineering literature. See, for example, 
Friedlander and Sharman (1985) and Stoica, Moses 
and Li (1991). The essential difference between this 
approach and the catch-all approach is that the lat- 
ter postulates an autoregressive model to match the 
observations. And we know that it is a wrong model, 
as we consistently do with all postulated models. 
Note that the use of sample ACFs at all possible lags 
has points of contact with the so-called generalized 
method of moments, used extensively in economet- 
rics. See, for example, Hall (2005). 

Next, we consider estimation based on Q (■,■)■ Gi- 
ven a finite sample size, we may stop at, say, the 
m-step-ahead prediction. Let e\ = (1,0, . . . ,0) T and 
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We estimate 9 by 

T m 

0{m} = argmin ^ ^jW k {y t -i+k 

8 t=p+l k=l 

(3.4) 

-ej$ k (y t -i,...,yt- p y} 2 , 

where is a weight function, typically positive def- 
inite. A reasonable choice of Wk is the absolute value 
of the autocorrelation function of the observed time 
series, that is, w k = \r y (k)\. We call 0{ m } in (3.4) 
the up-to-TO-step-ahead prediction estimator [APE 
or APE(< to)]. 

The asymptotic properties of 9{ m y will be dis- 
cussed later. 

3.2 Nonlinear Skeletons 

A deterministic nonlinear dynamic model with mea- 
surement error is commonly used in many applied 
areas, for example, ecology, dynamical systems and 
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others. See, for example, May (1976), Gurney, Blythe 
and Nisbet (1980), Tong (1990), Anderson and May 
(1991), Alligood, Sauer and Yorke (1997), Grenfell, 
Bj0rnstad and Finkenstadt (2002), Chan and Tong 
(2001) and the examples in Section 6. Consider us- 
ing the following nonlinear skeleton: 

(3.5) x t = ge(x t -i,...,x t - p ) 

to match the observable time series {yt}- 

Employing the Q(-,-) criterion, the estimator is 
given by 

T m 

0{ m }=argmin ^ ^ w k {y t -i +k 
6 t=p+l k=l 

(3.6) 

[m] i \i2 

which we again call the up-to-m-step-ahead predic- 
tion estimator [APE or APE(< m)]. Here the weight 
function {w k } is as defined in (2.3). 

For ease of explanation, we consider again yt = 
Xt + r] t and p = 1. Starting from any state xq = xq, 
let x t = gf^ (xo). Suppose the dynamical system has 
a negative Lyapunov exponent 

n-l 

X e (x )= lim n" 1 Vlog(|5e(xi)|) < 0, 

n— Yoo — * 

t=0 

for all states xq . Similarly starting from xt let xt+ m = 

g l e Hx t ). We predict by yt+m = g e (yt) - By the 
definition of the Lyapunov exponent, we have 

m ( x t + Vt) ~ 9™ (xt)\ ~ exp{m\ e (x t )}\r)t\. 

More generally, suppose the system x t = ge (xt-i, 
...,xt- p ) has a finite-dimensional state space and 
admits only limit cycles, but xt is observed as yt = 
xt + f]t, where {r/t} are independent with mean 0. 
Suppose that the function go(vi,..., v p ) has bounded 
derivatives in both 6 in the parameter space O and 
v±, . . . , v p in a neighborhood of the state space. Sup- 
pose that the system z t = go(zt-i, ■ ■ ■ , zt- p ) has only 
negative Lyapunov exponents in a small neighbor- 
hood of {x t } and in 6 £ 0. Let X t = (xt,xt-i, ■ ■ ■ , 
x t -p) and Y t = (y t ,y t -i, ■ ■ ■ ,yt-p}- If the observed 
Yq = Xq + (?7o,f?-i, • • • ,V-p) is taken as the initial 
values of {xt}, then for any n, 

f(ym+i, ■ ■ ■ ,y m +n\Xo) 

(3.7) -f(y m+1 \X = Y ) 

■■■f(y m+n \X = Y )^0 

as m — > oo. Suppose the equation Y^Xt-^dd (Xt-i) — 
Xt} 2 = has a unique solution in 6, where the sum- 



mation is taken over all limiting states. Let 9{ m y = 

a J gmin fl m- 1 ££l 1 E{yt_ 1+A! -gf{Y t ^)} 2 . If the 
noise takes value in a small neighborhood of the ori- 
gin, then 

^{m} — ^ $o asm— >oo. 

Note that |/(yi|X ) - f(yi\X = Y )\ ^ implies 
that 

f(yi,...,y n \X = Y ) 

^f(y 1 \X = Y )f(y 2 \X 1 = Y 1 ) 

■ ■■f{y n \X n -i = Y n -i), 

which challenges the commonly used (conditional) 
MLE. Equation (3.7) indicates that using high step- 
ahead prediction can reduce the effect of noisy data 
(e.g., due to measurement errors), and provide a bet- 
ter approximation of the conditional distribution. 
The second part suggests that using high step-ahead 
prediction errors in a criterion can reduce the bias 
caused by the presence of r\t ■ It also implies that any 
set of past values, for example, (yt-i, ■ ■ ■ ,yt-p) for 
t>p, can offer us an estimator with the first sum- 
mation in (3.6) removed. However, the summation 
over all past values is more efficient statistically. For 
further details, see Theorem C in the Appendix. 

There are other interesting special cases. For ex- 
ample, when the postulated model has a chaotic 
skeleton, the initial values play a crucial role. One 
approach is to treat the initial values as unknown 
parameters. See, for example, Chan and Tong (2001) 
for more details. Another example is when the pos- 
tulated model is nonlinear, and is driven by nonaddi- 
tive white noise with an unknown distribution. Here, 
the exact least squares multi-step-ahead prediction 
is quite difficult to obtain theoretically and time con- 
suming to calculate numerically; see, for example, 
Guo, Bai and An (1999). In this case, the up-to- 
m-step-ahead prediction method is difficult to im- 
plement directly. However, our simulations suggest 
that approximating the multi-step-ahead prediction 
by its skeleton is sometimes helpful in feature match- 
ing, especially when the observed time series is quite 
cyclical (Chan, Tong and Stenseth, 2009). 

4. ISSUES OF THE ESTIMATION METHOD 

We now turn to some theoretical issues and cal- 
culation problems. In conventional statistical theory 
for parameter estimation, by consistency is gener- 
ally meant that the estimated parameter vector con- 
verges to the true parameter vector in some sense 
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as the sample size tends to infinity. The postulated 
model is assumed to be the true model in the above 
conventional approach. 

In the absence of a true model and ipso facto true 
parameter vector, we propose an alternative defini- 
tion of consistency. Specifically, by consistency we 
mean that the estimated parameter vector will, in 
some sense, tend to the optimal parameter vector 
that represents the best achievable feature match- 
ing of the postulated model to the observable time 
series. To be more precise, for some positive inte- 
ger m (which may be infinite), we define the optimal 
parameter by 



i) 



m 



m , w = argmin / 



t+k 



-E{x t+k (9)\X t (6) = Y t }] 2 , 

where X t (6) = (x t (9), . . . , x t -p+i(0)) and {w k } de- 
fines the weight function, typically positive and sum- 
ming to unity. For ease of exposition, we assume that 
the solution to the above minimization is unique. 
Now, we say that an estimator is feature- consistent 
if it converges to # m , w in probability as the sample 
size tends to infinity. It is easy to prove that under 
some regularity conditions, 6^ m y is asymptotically 
normal, that is, 

r- 1/2 (fl {m} -tf m ,w)^iv(o,ft) 

for some positive definite matrix fi. For further de- 
tails, see Theorem D in the Appendix. 

The optimal parameter depends on m and the 
weight function w k . As discussed in Section 3.1, when 
the autocorrelation decays less slowly, we should 
consider using a larger m. Alternatively, we can con- 
sider assigning heavier weights for larger k. Our ex- 
perience suggests that, for a postulated linear time 
series model, w k can be selected as the absolute 
value of the sample ACF function. For a postulated 
nonlinear time series model aiming to match pos- 
sibly high degrees of periodicity, w k can be chosen 
as constant lasting for approximately one, two or 
three periods. Note that by setting w\ = l and all 
other Wj's zero, the estimation is equivalent to the 
LSE, and the MLE in the case of exponential family 
of distributions. 

The above feature suggests that we may regard 9{ m y 
as a maximum extended-likelihood estimator and fun- 
ctions such as Yn= p+ i Y!k=i Wk{yt-i+k -ej$ k (y t -i, 



\T\2 



} or their equivalents as extended-likeli- 
hoods (or XT-likelihoods for short), with Whittle's 



likelihood as a precursor. An XT-likelihood carries 
with it the interpretation as a weighted average of 
likelihoods of a cluster of models around the postu- 
lated model. In this sense, it is related to Akaike's 
notion of the likelihood of a model (Akaike, 1978). 

For the numerical calculation involved in (3.4) and 
(3.6), the gradient and the Hessian matrix of the 
loss function can be obtained recursively for differ- 
ent steps of prediction. Consider (3.6) as an exam- 
ple. Let g^ stand for 0p (y t -i, yt-p) and write 
g e (vi,...,v p ) as g(vi,...,Vp,6 1 ,...,9 q ). Let gf* = 

Vt-i, ■ ■ .,9 l i P+1] = yt- P , dg [ ™ ] /d0 k = and &g™ / 
(d9 k dd e ) = 0, k,£ = 1, . . . ,q if m < 0. Then for m > 1, 



9(9g ,---,9o , t/l, - - - , CqJ 



and 



^2 9i 

i=l 



[m-i] 



09 k 



+ 9 P +k(9 



m -!] „ m -PJ a a \ 



k = l,...,( 



where g k (vi,... , V p , . . . , Vp+q 
v p+q )/dv k ,k = l ) ... } p + q, and 

do [ml 



M h M t 

2 = 1 J=l 2=1 
P 



52 9 p+ k >^ 9 i 



m— 1] [m— p] 



1=1 



+ 9p+k,p+e{9e 



m— 1] [m— p] 



! • • • 1 ilR ) 



n , ■ ■ ■ , V q 



where gk,i(vi, . . . ,v p , . . . ,v p + q ) = d 2 g(vi, . . . , v p , . . . , 
v p+q ) I (dv k dvi) for k,£ = 1, . . . ,p + q. The Newton- 
Raphson method can then be used for the minimiza- 
tion. 



5. SIMULATION STUDY 

There are many different ways to measure the 
goodness of matching the observed by the postu- 
lated model, depending on the features of interest. 
We suggest two here. (1) The ACFs are clearly im- 
portant features in the context of linear time series, 
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and relevant even for nonlinear time series analysis. 
Therefore, a natural measure can be based on the 
differences of the ACFs, for example, 

1/2 

(5.1) 



T=100 



T=200 



T=500 



N 

*£{ry(k)-r x (k)} 2 /N 

k=0 



for some N, sufficiently large or even infinite, whe- 
re r y {k) and r x (k) are the theoretical ACFs (if avail- 
able) or sample ACFs. Clearly, we can use other 
distances to measure the differences of the ACFs. 
(2) For highly cyclical {yt}, we can measure the 
differences between the observed and the attractor 
(i.e., the limiting state) generated by the skeleton of 
postulated model, after allowing for possible phase 
shifts. Thus, we can use the following quasi-sample- 
path measure: 



T 



(5.2) 



mm 

k 



Xt+k\/T, 



t=i 



where T is the sample size as before. 

To check the efficacy of estimation of parameters, 
especially in a simulation study, we can use an obvi- 
ous measure: {(0 — 9) T (9 — 0)/p} 1 ' 2 for any estima- 
tor of = (0i,...,0 p ) T . Obviously, it is a func- 
tion of the number of steps m in APE(< m) or 
AYW(< m). Note m = 1 corresponds to the com- 
monly used estimation method based on the least 
squares, or the maximum likelihood when normal- 
ity is assumed. Note that the MLE is also based on 
the one-step-ahead prediction for dynamical mod- 
els that are driven by Gaussian white noise. In our 
plotting below, results for APE(< 1) and AYW(< 1) 
are not marked separately from those for APE(< m) 
and AYW(< m) with m > 1. 

Example 5.1 (Model misspecification). We pos- 
tulate an AR(p) model to match data generated by 
fractionally integrated noise (1 — B) d yt = et, where 
0.5 > d > —0.5 and B is the back-shift operator and 
{et} are i.i.d. N(0, 1). The process is stationary, but 
has long-memory when 0.5 > d > 0. The closer is d 
to 0.5, the longer is the memory. For the use of 
low-order ARMA models for short-term prediction 
of this type of long-memory model, see, for exam- 
ple, Man (2002). Any AR(p) model with finite p is 
a "wrong" model for the process. In the following 
analysis, the order p is assumed unknown and de- 
termined by AIC. 

The simulation results shown in Figure 1 are based 
on 2,000 replications. We have the following observa- 
tions. (1) With a misspecified model, the APE(< m) 




Fig. 1. Simulation results for Example 5.1 with differ- 
ent sample size T , index d and the number of steps m in 
AYW(< m) or APE(< m). In each panel, the dotted line, the 
solid line and the dashed line correspond to the Whittle esti- 
mator, the APE and the AYW, respectively. 

and the AYW(< m) with m > 1 show better match- 
ing of the ACFs than the APE(< 1) and AYW(< 1). 
When d is closer to 0.5, the AR model is less likely 
to fit the data well, thus necessitating a larger to. 
(2) When the autocorrelation is not strong, which is 
the case with d being close to zero, the AYW with 
large m shows better matching of the ACF than the 
APE; otherwise APE shows better matching. It is 
interesting to note that although APE does not tar- 
get the ACF directly, it can match the ACF well in 
comparison with the AYW. (3) For small sample size 
or when d is not so close to 0.5, the APE(< to) with 
m > 1 show better matching than the Whittle esti- 
mator; otherwise the Whittle estimator shows better 
matching. 

Example 5.2 (State-space model). Consider the 
AR(4) model with observation errors 

X t = PlXt-l + $lXt-1 + /?3^-3 + + E t , 

yt = xt + vt- 

This is also a special case of a state-space model. 
The estimation of the state model is of interest and 
has attracted considerable attention. See, for exam- 
ple, Durbin and Koopman (2001) and Staudenmayer 
and Buonaccorsi (2005). 

To cover as widely as possible all admissible values 
on the parameter space, we choose flitfiiifiz and /J4 
uniformly distributed in the stationary region. In 
the model, {et} is a sequence of independently and 
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Fig. 2. Results for Example 5.2 when the order p — 4 is 
known. In each panel, the dotted line, the solid line and the 
dashed line correspond respectively to the Kalman filter, the 
APE(< m) and the AYW(< m) over different m. 




Fig. 3. Results for Example 5.2 when the order p is selected 
by AIC. In each panel, the solid line is for APE(< m) and the 
dashed line is for AYW(< m). 



identically distributed random variables, each with 
a unit normal distribution, or i.i.d. N(0, 1) for short; 
{rj t } is i.i.d. N(0,a^), such that the signal-noise ra- 
tio <7^/ Var(yt) = sn is fixed. Again, we run the sim- 
ulation 2,000 times. The results are summarized in 
Figures 2 and 3. When p is known, Figure 2 sug- 
gests that APE(< m) and AYW(< m) with m > 1 
can usually produce models that better match the 
dynamics of the hidden state time series {xt} than 
APE(< 1) and AYW(< 1). When p is selected by 
AIC, Figure 3 suggests that APE(< m) and AYW(< 
m) with m > 1 can still lead to better matching than 
APE(< 1) and AYW(< 1). 

To compare with the Kalman filter approach which 
utilizes the maximum likelihood method or other 
methods such as the EM algorithm, we apply the R 
package "dim" kindly provided by Professor Gio- 
vanni Petris. The results are shown by dotted lines 
in Figure 2. When the order is known, the Kalman 
filter shows good performance in estimating the co- 
efficients and in matching the ACF, but it shows 
very unstable performance when the sample size is 
small. Even worse, if the order is selected by the 
AIC, the Kalman filter appears to be incapable of 
producing reasonable matching, so much so that the 
results are outside the range in Figure 3 in the wrong 
direction. 

Example 5.3 (Nonlinear time series model 1: 
smooth model). Consider the simple nonlinear model 

„2 



Xt 

Ut 



hxt-i + b 2 x t _ 

X t + (TlVt 



+ cr s t ; 
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Fig. 4. Results for Example 5.3 with T = 50 and different crn 
and o"i . The first panel is the estimation error of (61, 62) with 
o\ = \; the second panel is the difference of ACFs between the 
matching skeleton and the true ACF with a\ = 1; the third 
panel is the difference of ACFs between the matching skeleton 
and the estimated ACFs based on random realizations with 
<7i = 1. Panels 4~6 are respectively the corresponding results 
of panels 1-3 but with o~q — 0. 

with parameters 61 = 3.2 and 62 = —0.2; both e% 
and rjt are i.i.d. N(0, 1) but Et is truncated to lie 
in [—4,4]. We replicate our simulation 1,000 times 
for each set of variances a\ and a\. The matching 
results are shown in Figure 4. 

By coping well with noisy data due to <J\r], 
APE(m > 1) demonstrates substantial improvement 
on the parameter estimation (in panel 1 of Figure 4) , 
the ACF-matching of the hidden time series xt (pa- 
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t 



Fig. 5. The upper panel is a realization of the hidden skele- 
ton in Example 5.4- The lower panel is an observed time series 
subject to additive measurement error from N(0, 1). 

nel 2 of Figure 4) and the ACF-matching of the ob- 
served time series (in panel 3 of Figure 4). It is not 
surprising that when the model is perfectly speci- 
fied (i.e., o"i = 0), the APE(< 1) can provide better 
performance than APE(< m) with m > 1 in terms 
of the parameter estimation and the ACF-matching; 
see panels 4-5 of Figure 4. However, APE(< m) with 
m > 1 is still useful in matching features of the ob- 
served time series as shown in the last panel. Our 
results suggest that APE(< m) with m > 1 leads to 
less improvement over APE(< 1) when <jq (for the 
dynamic noise) is larger but greater improvement 
when o\ (for the observation noise) is larger. 

Example 5.4 (Nonlinear time series model 2: SE- 
TAR model). Now, we consider a self-exciting thres- 
hold autoregressive model (SETAR model) with ske- 
leton 

= f a + b x t -i, if x t -d<c, 
Xt \ai + b\xt-i, if xt-d > c, 



where parameters an = 3, bo = 1, a\ = —3, b\ = 1 and 
c = 0. A realization is shown in the first panel of Fig- 
ure 5. It reveals a period of 6 when d = 2, and 10 
(not shown) when d = 3. Suppose that we observe 
Ut = xt + r)t, where {rj t } are i.i.d. N{0,1). A typi- 
cal realization is also shown in the second panel of 
Figure 5. 

Using the APE approach to the simulated data, 
we denote the matching skeleton by xt and measure 
the matching error defined in (5.2) with T = 100. 
Based on 100 replications, we summarize the results 
in Table 1. The matching errors have means and 
standard deviations in the parentheses in column 3; 
the average and standard error (in the parentheses) 
of the periods in all the matching models are listed 
in column 4. Our results suggest that the APE(< m) 
with m > 1 performs much better than the APE(< 
1), both in terms of matching the dynamic range 
and the periodicity. 

6. APPLICATION TO REAL DATA SETS 

In this section, we study four real time series, some 
of which are very well known but others less so. They 
are the sea levels data, the annual sunspot numbers, 
Nicholson's blowflies data, and the measles infection 
data in London after the massive vaccination in the 
late 1960s. 

6.1 Sea Levels Data 

Long-term mean sea level change is of consider- 
able interest in the study of global climate change. 
Measurements of the change can provide an impor- 
tant corroboration of predictions by climate mod- 
els of global warming. Starting from 1992, in each 
year 34 equally spaced observations were recorded. 
The data with the linear trend and seasonality re- 
moved are available at http:/ /sealevel. colorado.edu/ 



Table 1 

The simulation results for Example 5.4 

Model Matching Cycle Frequency of 

setting Method error periods correct periods (%) 



T = 50, d = 2, APE(< 1) 2.1352 (1.0334) 5.3806 (0.6301) 31 

period = 6 APE(< 50) 0.8523 (0.6591) 5.8629 (0.5141) 92 

T = 50, d = 3, APE(< 1) 2.5301 (1.6729) 9.4839 (0.5824) 34 

period =10 APE(< 50) 1.3987 (0.8180) 9.9340 (0.1472) 66 

T = 100, d = 2 APE(< 1) 1.5260 (1.0643) 5.5884 (0.6912) 57 

period = 6 APE(< 50) 0.6471 (0.5301) 5.9180 (0.3940) 95 

T = 100, d = 3 APE(< 1) 2.7196 (1.6411) 9.4005 (0.6224) 34 

period =10 APE(< 50) 1.1502 (0.5133) 9.9705 (0.0770) 78 
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Fig. 6. Results for the sea level data. The data with linear 
trend and seasonality removed are shown in the first panel. 
Panels 2~4 are the smoothed sample SDF and those of the 
fitted models by MLE, the Whittle method and APE(< 20), 
respectively. Panel 5 is the relative averaged multi-step- ahead 
prediction errors by taking those of the one-step method as 
one unit. The curves marked by "o," "x," "★" and "o" are 
for APE(< 10), APE(<20), APE(< 30) and APE(< 50), re- 
spectively. 

current/sl_noib_ns_global.txt. The time series is de- 
picted in the first panel of Figure 6. Note that the 
data are subject to measurement errors of 3-4 mm. 

As an experiment with using a much less than 
ideal model to match this data set, let us postu- 
late an AR model. By AIC, the order of the AR 
model is selected as 6. Next, we apply the MLE 
(equivalently the one-step-ahead prediction estima- 
tion method), the Whittle method and the up-to- 
m-step-ahead prediction estimation method to the 
data. The results are shown in Figure 6. The sam- 
ple spectral density function (SDF) is estimated by 
the method of Fan and Zhang (2004). The results 
show clear evidence of long-memory with the sin- 
gularity at the origin, which is well captured by all 



three methods. However, for the peak away from the 
origin, the Whittle estimation and APE(< m) show 
very similar matching capability and both show much 
better match than the MLE. 

To investigate further, we build an AR(6) model 
for every span of observations of length T = 100 
and make predictions from 1 step ahead to 30 steps 
ahead. For the different estimation methods, their 
averaged prediction errors based on all periods are 
displayed in the bottom panels of Figure 6. The 
MLE method shows clear superior performance for 
short-term prediction, while the reverse is true from 
5 steps onward. 

6.2 Annual Sunspot Numbers 

Sunspots, as an index of solar activity, are relative- 
ly cooler and darker areas on the sun's surface resul- 
ting from magnetic storms. Sunspots have a cycle of 
length varying from about 9 to 13 years. Statisticians 
have fitted several models to predict sunspot num- 
bers. They have also noticed that the cycles are asym- 
metric and that the time from the initial minimum of 
a cycle to its next maximum, called the rise time, and 
the time from a cycle maximum to its next minimum, 
called the fall time, are fairly regular. Due to their link 
to other kinds of solar activity, sunspots are helpful 
in predicting space weather and the state of the iono- 
sphere. Thus, sunspots can help predict conditions 
of short-wave radio propagation as well as satellite 
communications. Historical data of the sunspots have 
been recorded in different parts of the world. The 
data we use are the annual sunspot numbers for the 
period 1700-2008 which are obtainable from http:/ / 
www.ngdc.noaa.gov/stp/SOLAR/. Yule (1927) was 
the first statistician to model the sunspot number 
using a model, now known as the autoregressive 
model, with lag 2. Later refinements of stationary 
linear models can be found in, for example, Brock- 
well and Davis (1991) and others; higher-order AR 
models or ARMA models are used. Akaike (1978) 
suggested that the data are better modeled as non- 
stationary over a long period. Tong and Lim (1980) 
noticed nonlinearity in the data dynamics and pro- 
posed the use of a self-exciting threshold autoregres- 
sive model (or a SETAR model for short). In the 
following, we postulate a two-regime SETAR model 
of order 3 with delay parameter equal to 2 for the 
annual sunspot numbers (1700-2008). Specifically, 

x = { a ° + hQXt ~ l + c 0^-2 + d x t _ 3 , if X t -1 < r , 

where xt = log(no. of sunspots+ 1). Note that Cheng 
and Tong (1992) recommended a nonparametric 



14 



Y. XIA AND H. TONG 





2 3 45 7 10 1520 30 4050 



1 2 3 45 7 10 1520 30 4050 

prediction horizon 




1 2 3 45 7 10 1520 30 4050 

prediction horizon 



Fig. 7. The dashed curves are the averaged prediction er- 
rors based on APE(< 1), the solid curves are those based on 
APE(< m) with m = 10, 20, 30, 50, respectively. The horizon- 
tal dashed lines are the matching errors for the APE(< 1), the 
solid lines are those for APE(< m) with m = 10, 20, 30, 50, re- 
spectively. 

AR(4) model. We also tried SETAR model of order 4 
with delay parameter equal to 2. The performances 
of both models are very similar. 

We use each fixed span of T observations to fit 
the postulated model and then use it to do a post- 
sample prediction based on the skeleton of the fit- 
ted model. We measure the following: (1) the dif- 
ference of cycle periods between the data and the 
fitted model; (2) the frequency of stable fitted mod- 
els; (3) the out-of-sample prediction errors based on 
the skeletons of models fitted by the APE(< m) for 
different m; (4) the difference between the observed 
time series and the time series generated by the best 
fitting skeleton by reference to (5.2). 

The results are shown in Figure 7 and Table 2. 
We may draw the following conclusions. (1) When 
the observed time series is short (e.g., T = 20,35), 



APE(< m) with m > 1 show better matching than 
APE(< 1) in both one-step-ahead prediction and 
multi-step-ahead prediction; see panels 1 and 2 in 
Figure 7. When the length of the time series is longer 
(e.g., T = 50, 100), APE(< 1) can lead to fitted mod- 
els with better short-term (less than 4 steps ahead) 
prediction than APE(< m) with m > 1, but for pre- 
diction beyond 4 steps ahead, the reverse appears 
to be the case, in line with our understanding of 
the APE method. (2) When the observed time se- 
ries is short, APE(< m) with m > 1 shows its abil- 
ity in avoiding unstable models; see the numbers in 
the square brackets of Table 2. (3) For both short 
time series and long time series, models fitted by 
APE(< m) with m > 1 show better matching of the 
observed time series in terms of their cycles; see Ta- 
ble 2 and the horizontal lines in Figure 7. 

6.3 Nicholson's Blowflies 

The data consist of the total number of blowflies 
(Lucilia cuprina) in a population under controlled 
laboratory conditions. The data represent counts for 
every second day. The developmental delay (from 
egg to adult) is between 14 and 15 days for the 
blowflies under the conditions employed (Gurney, 
Blythe and Nisbet, 1980). Nicholson obtained 361 
bi-daily recordings over a 2-year period (722 days). 
However, due to biological evolution (Stokes et al., 
1988), the whole series cannot be considered to rep- 
resent the same system; a major transition appears 
to have occurred around day 400. Following Tong 
(1990), we consider the first part of the time series 
(to day 400, thus T = 200), for which the population 
has a 19 bi-days cycle; see Figure 8. 

Next, we postulate the single species animal pop- 
ulation discrete model (1.2) with b(xt- T ) = cXt-r ' 
exp(— xt- T /No), and thus 



.exp(-xt- T /N ) + uxt-i, 



Table 2 

The averaged difference (and its standard error) of cycle periods in the data and matching models 
and the number of unstable matching models [in squared brackets] 



m in Length of time series 

APE(< m) 20 35 50 100 



1 2.5448 (3.0084) [42] 1.7115 (1.8162) [2] 1.3355 (1.5718) [0] 1.5934 (1.4051) [0] 

10 1.3454 (1.7082) [13] 0.9576 (0.8499) [0] 0.8459 (0.9584) [0] 0.4487 (0.5427) [0] 

20 1.2972 (1.7143) [10] 0.8975 (1.1257) [0] 0.7580 (0.6074) [0] 0.4134 (0.9715) [0] 

30 0.8802 (1.1415) [1] 0.8449 (0.5807) [0] 0.3640 (0.5894) [0] 

50 0.8548 (0.5813) [0] 0.3538 (0.4267) [0] 
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where we take r = 8 (bi-days) corresponding to the 
time taken for an egg to develop into an adult. Note 
that we specify b(xt- T ) slightly differently from Gur- 
ney, Blythe and Nisbet (1980) by adding an expo- 
nent a — 1 to xt- T , which is usually necessary when 
a differential equation model is discretized and ap- 
proximated by a time series model; see Glass, Xia 
and Grenfell (2003). In the model, there are four pa- 
rameters: c,a,No and v. The (one-step-ahead pre- 
diction) MLE estimates for the parameters are 

c = 20.1192, iV = 589.5553, 

v = 0.7598, q = 0.8461. 

The APE method gives 

c = 591.5801, N = 1307.0, 

v = 0.6469, a = 0.2633. 

The skeletons based on the postulated model with 
parameters estimated by above methods are shown 
in panels 1 and 2 in Figure 8, respectively. They 
show that APE(< T) results in a model whose skele- 
ton matches the observed cycles to a much greater 
extent than APE(< 1). APE(< 1) gives a period 
of 21 bi-days; APE(< T) gives a period of 19 bi- 
days, which is almost exactly the average period of 
the observed cycles. We have also postulated a SE- 
TAR model. With APE(< T), the SETAR model 
can also capture the observed period very well, but 
again this is not the case with APE(< 1). To inves- 
tigate how the cycles change with the time needed 
by the fly to grow to maturity, we vary the time r 
from 4 to 100 bi-days. The corresponding cycles 
(in bi-days) are shown in the last two panels of 
Figure 8. APE(< T) shows a clear linearly increas- 
ing trend in the cycle-periods as r increases, while 
APE(< 1) shows strange excursions that are diffi- 
cult to interpret. The linear relationship suggested 
by APE(< T) may be helpful in throwing some light 
on the important but not completely resolved cy- 
cle problem of animal populations. We have also 
tried APE(< m) with m equal to twice or thrice 
the cycle-period. Their results are similar to those 
of APE(<T). 

6.4 Measles Dynamics in London 

It is well known that the continuous-time suscepti- 
ble-infected-recovered (SIR) model using a set of 
ordinary differential equations can describe quali- 
tatively the behavior of epidemics quite well. How- 
ever, it is difficult to use it for real data modeling 
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Fig. 8. Results for the Nicholson's blowflies data. In the 
first two panels, the dashed lines are for the observed popu- 
lation; the solid lines are for realizations from models fitted 
by APE(< 1) and APE(< T), respectively. The dashed lines 
in panels 3 and 4 are the periodograms of the observed data, 
and the solid lines are those of the models fitted by APE(< 1) 
and APE(< T), respectively. In panels 5 and 6, for each t 
marked in the x-axis, the vertical column is the periodogram 
with the values color-coded, brighter color (blue being dull) 
corresponding to higher power value. Thus the brightest point 
indicates the cycle-period of the dynamics at t. 



when the observations are made in discrete time. 
To bridge the gap between the theoretical model 
and real data fitting, several discrete-time or chain 
models have been introduced. The Nicholson-Bailey 
host-parasite model (Nicholson and Bailey, 1935) is 
an early example. Bailey (1957), Bartlett (1960) and 
Finkenstadt and Grenfell (2000) proposed different 
types of discrete-time epidemic models. A general 
discrete-time or chain model can be written as fol- 
lows: 



(6.1) 



St+i — St + B t 
I t+l = S t P(I t ), 



t+i, 



where It, St and Bt are respectively the number of 
the infectious, the number of the susceptible and the 
number of births, all at the ith time unit. There are 
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many possible functional forms for the (probabil- 
ity) P(I t ). Examples are 1 - (1 -ro/iV) 7 ' (Bartlett, 
1960), 1 - exp(-r I t /N) (Bartlett, 1956), r I t /N 
(Baily, 1957) and Rolf /N (Liu, Hethcote and Levin, 
1987; Finkenstadt and Grenfell, 2000), where N is 
the effective population of hosts, and tq is the basic 
reproductive rate. 

Next, we postulate the following (deterministic) di- 
screte-time SIR model for the transmission of measles: 

I t+ i = ex.p(6 t ,kPk)StIt, 

t t+i 

St+i = St + bt — It+i = So + ^2 Bt — It, 

t=0 r=l 

where exp(8t t kPk) is employed to indicate the sea- 
sonality force, with 8t & = 1 if time t is at the kth 
season, otherwise. For measles, the time unit for t 
is bi-weekly, based on the infection procedure of 
measle; see Finkenstadt and Grenfell (2000). Now, 
k = 1, . . . , 26 bi- weeks corresponds to about 54 weeks 
in a year. Finkenstadt and Grenfell (2000) consid- 
ered the same model but with the first equation be- 
ing I t +\ = exp(8t,kPk)StIt ■ Here, we take a = 1 for 
two reasons. (1) If a < 1, Finkenstadt and Grenfell 
(2000) were unable to use the model to explain the 
dynamics of measles in the massive vaccination era. 
(2) Experience with statistical modeling of ecolog- 
ical populations suggests that a can be taken as 1 
with improved interpretation; see Bj0nstad, Finken- 
stadt and Grenfell (2002). In practice, It may not 
be observed directly; what can be observed is a ran- 
dom variable, say yt, that has mean It. For this ob- 
servable yt, we postulate a model x t that follows 
a Poisson distribution with mean It- 

There are some problems with the data. There 
is nonnegligible observation error in the data due 
to the under-reporting rate, which can be as high 
as 50%; see Finkenstadt and Grenfell (2000), where 
a method was proposed to recover the data. Fol- 
lowing their method, the data were adjusted for the 
under-reporting rate. The adjusted data are shown 
in dashed lines in panels 1 and 2 of Figure 9. It is 
known that the role of vaccination is equivalent to 
the reduction of the birth rate (Earn et al., 2000). 
Thus, we adjust the number of births by multiplying 
it by the un-vaccination rate, that is, 1— (vaccination 
rate). We show the adjusted births in the third panel 
of Figure 9. Another problem with the data is that 
the susceptible St is unknown, which can also be 
reconstructed by the method of Finkenstadt and 
Grenfell (2000). 
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Fig. 9. Results for modeling the measles incidents in Lon- 
don. The dashed lines in panels 1 and 2 are the recovered 
incidents of measles; the solid lines are the realizations of the 
model based on APE(< 1) and APE(< T), respectively. Panel 
3 is the adjusted birth rate by removing the vaccinated; in the 
bottom panels, the dashed lines are the periodograms of the 
data and the red lines are those of the matching skeleton by 
APE(< 1) and APE(<T), respectively. 

The estimates of the model by APE(< 1) are listed 
in Table 3. To ease the calculation of APE(< T), we 
simplify the model by taking (3k = P + \(Pk,i ~ P), 
where fii i, ■ ■ ■ ,/?26,i are the estimates of APE(< 1) 
and P is their average. Consequently, only A and So 
need to be estimated in implementing APE(< T). 
The skeletons based on models fitted by APE(< 1) 
and APE(< T) are shown in solid red lines in panel 1 
and panel 2 of Figure 9, respectively. APE(< T) 
shows a much better match than APE(< 1) in terms 
of outbreak scale and cycle period. The periodogram 
is also much better matched by APE(< T) than by 
APE(< 1); see the last two panels of Figure 9. We 
have also tried APE(< m) with m being twice or 
thrice the cycle period (i.e., 26 bi- weeks). The re- 
sults are similar to APE(< T). 

An important feature in the measles transmission 
is that there were some big annual outbreaks in the 
1950s when the birth rate was very high after the 
second world war, and some big bi-annual outbreaks 
in the middle of the 1960s when the birth rate was 
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Table 3 

Parameters in the measles transmission model 



Method 


0i 


02 


03 


04 


/3s 


06 


0r 


08 


APE(< 1) 


-11.92 


-12.00 


-11.88 


-11.99 


-11.89 


-11.81 


-11.89 


-11.97 


APE(< T) 


-11.95 


-12.00 


-11.93 


-11.99 


-11.93 


-11.89 


-11.93 


-11.98 




/3 9 


0io 


0u 


012 


013 


014 


015 


016 


APE(< 1) 


-11.92 


-11.99 


-12.05 


-12.01 


-11.93 


-11.96 


-11.98 


-12.04 


APE(< r) 


-11.95 


-11.99 


-12.03 


-12.00 


-11.96 


-11.98 


-11.99 


-12.02 




017 


018 


019 


020 


021 


022 


023 


024 


APE(< 1) 


-11.95 


-12.15 


-12.28 


-12.40 


-12.21 


-11.99 


-11.79 


-11.87 


APE(< T) 


-11.97 


-12.08 


-12.16 


-12.23 


-12.12 


-11.99 


-11.87 


-11.92 




/3 25 


026 


So 












APE(< 1) 


-11.99 


-11.98 


17,8280 












APE(< T) 


-11.99 


-11.98 


16,8190 













relatively low. The dynamics before the massive vac- 
cination in the late 1960s was modeled very well 
by a time series model in Finkenstadt and Gren- 
fell (2000). The theory that relates population cy- 
cle length to birth rate has been well accepted in 
epidemiology and ecology. In epidemiology, the re- 
lationship will either prolong or shorten the cumu- 
lation procedure of susceptibles for a big outbreak. 
Observations from the other sources have lent sup- 
port to this theory. For example, the measles in 
New York have a three-year or four-year cycle when 
the birth rate is very low. As another supporting 
piece of evidence, in the vaccination era, the cycles 
lasted longer, to four or five years because vaccina- 
tion is equivalent to the reduction of birth rate in 
the transmission of disease. However, the dynamics 
after the massive vaccination is difficult to model 
due to the quickly changing birth rate. The method 
of Finkenstadt and Grenfell (2000) has failed to cap- 
ture this change of cycles in the vaccination era. It 
is therefore worth noting that our modified model, 
with the aid of APE(< m) with m > 1, shows sat- 
isfactory matching. To investigate further how the 
cycles change with the birth rate, for each fixed num- 
ber of births we run the estimated model and depict 
its periodogram and highlight the peaks by color- 
coding (brighter color for higher power). The peaks 
with the brightest points correspond to the cycles of 
the postulated model. Figure 10 shows clearly that 
when the birth rate is high (from about 5,000 up- 
ward) the cycle is annual, but when the birth rate is 
medium at about 3,000 to 4,000, the cycles become 
two-year cycles. As the birth rate gets lower, the 
model shows that cycles become three- year cycles or 




number of births 



Fig. 10. Measles transmission. Each vertical column is the 
periodogram with the values color-coded, brighter color cor- 
responding to higher power. (Dark blue is considered a dull 
color.) 

even five-year cycles. It seems that by fitting a sub- 
stantive model with the catch-all approach, we have 
obtained perhaps the first discrete-time model that 
is capable of revealing the complete function link- 
ing birth-rates to the cyclicity of measles epidemics, 
thereby lending support to the general theory de- 
veloped by Earn et al. (2000), which was based on 
differential SIR equations in continuous time. 

7. CONCLUDING REMARKS 
AND FURTHER PROBLEMS 

In this paper, we adhere to Box's dictum and 
abandon, right from the very beginning, the assump- 
tion of either the postulated parametric model being 
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true or the observations being error-free. Instead, 
we focus on ways to improve the feature matching 
of a postulated parametric model to the observable 
time series. We have introduced the notion of an op- 
timal parameter in the absence of a true model and 
defined a new form of consistency. In particular, we 
have synthesized earlier attempts into a systematic 
approach of estimation of the optimal parameter, 
by reference to up-to-m-step-ahead predictions of 
the postulated model. We have also developed some 
general results with proofs. 

Conventional methods of estimation are typically 
based on just the one-step-ahead prediction. Our ana- 
lysis, simulation study and real applications have con- 
vinced us that they are often found wanting in many 
situations, for example, the absence of a true model, 
short data sets, observation errors, highly cyclical 
data and others. Our stated primary objective is fea- 
ture matching. Prediction is secondary here. How- 
ever, we have evidence to suggest that a model with 
good feature matching can stand a better chance of 
enjoying good medium- to long-term prediction. Of 
course, if the aim is prediction with a specified hori- 
zon, say mo, then we simply set w mo = 1 and the 
rest zero. In that case, our catch-all approach really 
offers nothing new. 

Let us now take another look at the difference be- 
tween APE(< m) with m > 1 and APE(< 1). Sup- 
pose we postulate the model x t = ge{X t -\) + £t whe- 
re X t -i = (xt-i, ■ ■ ■ , xt- p ) to match an observable 
y-time series. Given data {yi,y2, ■ ■ ■ ,Ut} , APE(< 
m) with m > 1 and with a constant Wj > 0, 
all j, estimates 9 by minimizing the objective func- 
tion 



T mm(m,T—t) 

L m {9)= Y Y {Vt-i+k 
t=p+l k=l 

= L 1 (8) + L+(9), 
where Y t -\ = (yt-i, Vt- P ) and 

T 

Li{9)= Y {yt-i ] (Y t ^)} 2 , 

t= P +i 

T min(m,T— t) 

L t(0)=Y E Ivt-i+k-gPiYt-!)} 2 . 

t=p+l k=2 



Note that L\(9) is the commonly used objective 
function for APE(< 1), while Lf(9) is the extra in- 
formation provided by the dynamics. In terms of 



samples, L\(9) is based on sample {yt, Yt-\ :t = p + 

1. . . . ,T}. The extra term Lf(9) is associated with 
the extra pseudo designed samples {y t -i+k, Yt-i '■ t = 
p + 1, . . . ,T, k = 1, . . . , m}. If the data are actually 
generated by the postulated model (a rare event), 
then under some general conditions such as et are 

1.1. d. normal, L\(9) will include all the information 
about 9. In that case, estimation based on L\(9) 
alone is the most efficient and the extra term Lf(9) 
can provide no additional information. However, if 
the data are not exactly generated by the postu- 
lated model (a common event), the extra informa- 
tion provided by Lf(9) can indeed be very helpful 
and should be exploited. 

Despite evidence, both theoretical and practical, 
of the utility of the catch-all approach, much more 
remains to be done. Our paper should be seen as 
the first word on feature matching. Although we 
have provided some concrete approaches, such as the 
catch-all-conditional-mean approach, the catch-all- 
ACF approach, which can easily be generalized to 
catch-all-mth-order moments and others, there are 
outstanding issues. For example, we can, at present, 
offer no theoretical guidance on the specification of 
the weights, {w m }. We have only offered some prac- 
tical suggestions based on our experience. It would 
be interesting to investigate further possible connec- 
tions with a prior in Bayesian statistics. 

We have been quite fortunate with our real ex- 
amples using the APE method, thanks to our long- 
standing collaboration with ecologists and epidemi- 
ologists. However, we are conscious of the need for 
the accumulation of further experience. We are con- 
vinced that, especially in the of substantive 
modeling, guidance by relevant subject scientists is 
paramount. Relevant references include He, Ionides 
and King (2010), King et al. (2008), Laneri et al. 
(2010) and others. 

Last but not least, future research should include 
at least the following: other weaker forms of (2.1), 
choice of a suitable weaker form in a specific ap- 
plication, other criteria for model comparison, non- 
additive and/or heteroscedastic measurement errors, 
the relaxation of stationarity, the effect of prefilter- 
ing of data, multiple time series, model selection 
among a set of wrong models (each fitted by the 
catch-all method; perhaps the idea of model cali- 
bration in econometrics might be useful here), pos- 
sible extension to other types of dependent data, for 
example, spatial data. 
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APPENDIX: OUTLINES OF THEORETICAL 
JUSTIFICATION 

We need the following assumptions. However, the- 
se assumptions can be relaxed with more compli- 
cated theoretical derivation. 

(CI) Time series {yt} is a strictly stationary and 
strongly mixing sequence with exponentially 
decreasing mixing-coefficients. 

(C2) The moments E\\y t \\ 2S , E\\gf\y t , . . . ,y t - P )}\\ 2S , 



,y^ P )/d6\\ 5 and E| 
{89d9 T )\\ s exist for some 5 > 2. 



(C3) The functions dg£\y t )/d6 and d 2 gf ] (y t )/ 
(89d9 T ) are continuous in 9 £ and 



n 



def 



k=l 



09 d9 T 



[y 



[k] f ^ d 2 9# ] (yt) 
k-g$ (yt)\ qqqqt 



is nonsingular. 
(C4) The function J2T=i^E[y t+k - gf ] (Y t )] 2 has 
a unique minimum point for 9 in the parame- 
ter space 0. 

Theorem A. Suppose that {x t {9)} and {y t } ha- 
ve the same marginal distribution and each has se- 
cond-order moments. Then 

D (y t ,x t (e))<CxQ(y t ,x t (9)), 
D p (y t ,x t (9))<C 2 Q(y t ,x t (9)) 



C\yt))\- 

(y t )}}. By the 

rw> i/2 - 



By the assumption on the marginal distribution, we 
have 

V{y t gf\y t )} = E{x t g [ ™\x t )} 

= E{x t E(x t+ m\xt)} = E(x t X t+m ). 

Thus 

E{ytyt+ m ) = E(x t x t+m ) 

(A.2) 

+ E[y t {E(y t+ 

m\yt 

It follows from (A.l) and (A.2) that 
j y (m) =7 x (m) + A m , 

where A m = E[y t {E(y t+m \y t ) - 
Holder inequality, we have 

|A m | < {Ey^iEiEiyt+rnlyt) 
Therefore, 

D c (x t (9),y t ) 

oo 

< sup^w^Ey 2 } 1 / 2 

{u>k} k=0 

■ {E{E{y t+k \y t ) - i\ m )} 2 } 1 ' 2 

<CiQ(0), 

where C\ = {Ey 2 } 1 ^ 2 . This is the first inequality of 
Theorem A. 

For ease of exposition, assume that {yt} and {xt(9)} 
are given by AR models with the same order, P. 
Otherwise we take the order as the larger of the 

two orders. So y t = fiiyt-i H h fipyt-p + e t and 

x t = 9ix t -.i-\ V 9px t -p + rj t . 

Let ei = (1,0,...,0) T , = (y t _i, . . . , y t - P ) T 



for some positive constants C\ and C 2 . Moreover, if Xt-i = (xt-i, ■ ■ ■ , xt-p) T , £t = (^t, 0, . . . , 0) T and 
{xt(9)} and {yt} are linear AR models, then there 
are some positive constants C3 and C4 such that 

Q(yt,xt(9))<C 3 D c (y t ,xt(9)), 

Q(yuXt(9))<C 4 D F (y t ,x t (9)). 

Proof. By the condition on the marginal dis- 
tributions, we have 

(A.l) E(y t+m ) = E(x t+m ). 

Since E[y t {y t+m - E(y t+m \yt)}] = 0, we have 

E(y t yt +m ) = E{y t g [ ™\yt)} + E[y t {y t+m - g^(y t )}] 

= E{y t g [ ™\yt)} 

+ E[yt{E{y t+m \yt)-g [ ™\yt)}]. 
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Then Y t - 1+m = ejT^Y t -i + eJ (£ t -i+m + To£t-2+m + 
■■■+T^£ t ). It follows that 

, , hy(ni),i y (m+l),...,iy(m + P-l)] 
(A.3) 



E(y t - 1+m Y t l 1 ) = eJr™ 
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where S = E(Y t -iY^_ 1 ) = (j y (\i - j\))i<i,j<p- Sim- 
ilarly, we have 

[j x (m),j x (m + l),...,"f x (m + P- 1)] 

(A.4) 



E(x t - 



T 



i+ m X t _i 



where £ = E{X t -\X, 



(7*(I<-J|)) 



i<«j'<p- 



Assuming e t , are independent sequences of i.i.d. 
random variables, we have 

E( Ift _ 1+tn |r t _i) = e7rs»y t _i, 
E(x t _ 1+m |x t _! = r t _!) = e7r m y t _!. 

(Note: The i.i.d. assumption can be relaxed at the 
expense of a much lengthier proof.) It follows that 

E{E(y t _i +m |y t _i) - E(^_i +m |X t _! = y_!)} 2 

= e7(r^-r m )s (r^-r m ) T ei 
= e7[r^s -r m s + r m (s-s )] 
• E„ 'Fo^o - r m s + r m (s - s )] T ei 

^ A min( S o)||[7 2 /M-7 a; (w)> 

7^(777,+ 1) - j x (m + l), . . . , 
7„(m + P - 1) - 7x (m + P - 1)] 



+ e7r m (s-s ) 



m+P-1 



k=m 



P-l 



+ A-L(So)AS ax (r)P X){7y(fc) - 7*(*0} 5 



fc=0 

where A m i n (£o) and A max (T) are the minimum eigen- 
value of So and the maximum eigenvalue of T, re- 
spectively. Note that A max (T) < 1. Therefore, 

oo 

Q{9) < PA^(Eo) £ *M7y(*0 " 7x(^)} 2 

m=0 

= C 3 D c (xt(6),y t ), 
for some w m > 0. The proof is completed. □ 

Theorem B. Under assumptions (CI) and (C2), 
we /jawe in distribution 



Proof. To simplify the range of summation in 
the triangular array due to the lags with fixed m 
as T — > oo , we introduce = to denote the fact that 
the quantities on both sides of it have negligible dif- 
ference. By Theorem 3.1 of Romano and Thombs 
(1996), in an enlarged probability space we have 

f m = T m + n- l ' 2 U m + o p (n" 1 / 2 ), 



T 



T m + n- 1 / 2 V m + o p (n- 1 / 2 ), 



where U m and V m have the same structure as T m 
and T m , respectively, but with 7 (fc) being replaced 
by v k and (v i+1 , . . . ,v i+j ) for any i,j being jointly 
normal, with variance-covariance matrix given by 
Romano and Thombs (1996). Therefore, we have 

e m = $ + n- 1 l 2 W + o P {n- l l 2 ), 

'Tr W7/Tx _i_ CrTr \-lrT 



JJ m T m + (T m T m j 



r 1 v — 

1 m y rn 



r m T m is a lin- 



where W = {Y^Y^Ui 

(r m r m ) {r m z^ m +z^ m r m }(r m r. 

ear combination of {ufe}. Thus, W is normally dis- 
tributed with mean 0. This is the first part of The- 
orem B. 

If y t = x t + r)t, let 7 x (fe) = 7?," 1 Yn=i x t%t+k; it is 
easy to see that 

%(k)^%(k)+D k + E k , A: = 0,1,..., 

where D k = n' 1 J2™=i(x t+k + x t - k )r]t and = 
n_1 SILi VtVt+k- By the central limit theorem and 
Theorem 3.1 of Romano and Thombs (1996), in an en- 
larged probability space there are random variables 
£ fc ,Cfc and 5 k such that %(k) = J x {k) + n _1 / 2 £ fc + 
Opin- 1 / 2 ), D k = n~ l / 2 C k + Opin" 1 ' 2 ) and 



E k 



tf + n-^dk + opin- 1 / 2 ), iffc = 0, 
n- 1 / 2 ^ + o p (n- 1 / 2 ), ifjfe>0, 



where £ ,fi, • • • ,{Ck,k = 0,1,.. .},£ ,5i, ... are mu- 
tually independent and ^ k = ^(^{Eef — l} 1//2 W"o + 
Ef=i&x(j + k) + 7x(j ~ k)}W r Here W , W u ... are 
i.i.d. #(0,1), Cfc~^(0,2( 7y (0)+7,,(2A;))),Cov(Cfc ) 
0) = 2( ly (k -£) + j y (k + £)) and S k ~ #(0, oj) if 
fc > and <5 ~ iV(0, E(V - l) 2 ). Define E k , Z k and A fc 
similarly as Y k with r ) x {k) being replaced by 
and S k , respectively. Let B k be a k x p matrix with 
the first p x p submatrix being o~ 2 I p and all the oth- 
ers 0. We have 



V^{#{m.}-tf}^#(0,£ m ), 

where § = (Y^Y^Y^t m and t m is a positive de- Tk = Tk + B k + n~ l/2 £ k + o p {n~ l l 2 

finite matrix. As a special case, if y t = x t + rjt with where £ k = E k + Z k + A k , 
Var(ei)>0 and Var(r7 i )=cr 2 > 0, then the above asym 
ptotic result holds with -0 = 6 + cr 2 (r^r m + 2a 2 Y p + 
aprHY p + a 2 I)9. 



T k = T k + n- l ' 2 ^ k + o v 



n 



n 



-1/2) 
-1/2^ 
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and ty k = (£1, . . . , £fc) T . It follows that tion, we have x t = ge (X t -i) . Write 

6 m = [TlT m + 2a 2 F p + ap E[{gf ] (Y t ^) - x t . 1+k } 2 ] 

+ n- 1 ' 2 {{T m + B m ) T £ m + 4(r m + B m )} = { g W (Xt _ l} _ g M {Xtl) f 

• \rlr m + a 2 v r p E[g[k] ^ + ^ _ ^ ( ^ )} 

+ E[{ 5 ; fcl (x t _ 1 + £: t „ 1 )-^ 1 (^„ 1 )} 2 ]. 

Note that by the definition of the Lyapunov exponent, 



+ rT x ' 2 {{Y m + B m ) T V m + £lT m 0] 

+ o P {n- 1 ' 2 )} 

^Tr , o_2t- , Jn-l/rTr , _2 



(r m rm + 2<7 J? r p + a v I) (T m T m 
+ n- l l 2 W n + o{n- 1 ' 2 ) (A.5) 

e - a^vlXm + 2a 2 r p + ap)-\r p + o- 2 I)0 



<exp(A;A){E|| ( 5 t ||} 1 / 2 . 



Starting from Xq = Yq, the system at the fcth step 
+ n W n + o(n ), j g ^[fc] (y ). Since the Lyapunov exponent is negative, 



where W m = (I^r m + 2a 2 T p + ^/)- 1 {(r m + we have 

B m ) T * m + & m 0] - (TlT m + 2a 2 r p + ap)~ 2 ■ (g^ (y ) , . . ., 5 f +nl (F )) 

{(r m + s m ) T £ m +«?^(r m + J B m )}(r^r m + a 2 r p ) is 

normally distributed. We have proved the second 



(g^ +1 \x ),...,g^ +n \x )) 



P art - D +(6 m+u ...,S, 



1 Vm+n ) i 



Theorem C. Suppose the system {x t = g 6o {x t -i, whe re 5 k = $ ] (Y ) - gf ] (X ), with \S k \ < exp(A;A) • 

...,Xt-p)} has a finite-dimensional state-space and II) 1 / 2 Th f ° 

admits only limit cycles, but x t is observed as yt= 1 " ' ' 

xt + rjt, where {q t } are independent with mean 0. (y m+1 , . . . ,y m+n )\(X = Y ) 
Suppose that the function gg(vi, ... ,v p ) has bounded 

derivatives in both 6 in the parameter space and ~ ^ m+1 ' ' ' ' >y™+n)\ o + \ m +!' • ■ • > m+nj- 

ui, . . . , v p in a neighborhood of the state-space. Sup- Note that ( ym+lj . . . , y m+n )|X = (g l ™ +1] (X ), 

pose i/iai i/ie system z t = g g (z t -i,. . .,z t - p ) has only [ m+n ] , , . ° 

negative Lyapunov exponents in a small neighbor- ' oj j + ( ? ?m+i 5 • • • , ^m+nj an a JJm+l) • ■ • i 

W o/ {x t } and m 6 G 6. Lei X* = (x t ,x t _i, . . . , are inde P endent - Therefore the first part of 

Ineorem C follows. 



x t _p) andY t = (yt,yt-i,---,yt-p)- 
1. // i/ie observed Yq = Xq + (r/o, 77 1, ■ • ■ , Tj-p) ^ s ta- 



By (A.5), we have 



ken as the initial values of {xt}, then for any n, 
f(y m+1 , y m+n \X ) < Cexp(fcA){E||^||} 1 / 2 . 

- f(y m+1 \X =Y )--- f(y m+n \X Q = Y ) -> It follows that 



as m — ¥ 00 . 

2. Suppose the equation ^x t _i{#0 (^t-i) — %t} 2 = 
has a unique solution in 9, where the summa- 



l J2E{x t „ 1+k -gf ] (Y t )Y 



m 

k=l 



tion is taken over all limitinq states. Let 9< m \ = _i < \k\,^ r s \k\,^ r N1 2 

argmm e m i 2^fc=i E i Vt-l+k ~ 9 e { Y t-i)i ■ V the k =i 

noise takes value in a small neighborhood of the m 

origin, then 6 {m} -> asm ^00. < C'{E||^||} 1/2 m~ 1 ^exp(fcA) 

Proof. Let F t _i = (y t -i, . . . ,y t - p ),£ t -i = (vt-i, k=1 

. . . ,r]t-p) and X t -\ = (xt-i, . . ■ ,xt- p ). By the condi- = A(m) — >• as m — > 00. 
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By the same argument leading to (A. 6), we have 



m 



fc=i 

(A.6) <m~ 1 ^E{x t 



m 



i Xt+k- 



xt+kY 



-l+k 



(Yt)Y 



k=l 



$(X t )} 2 + A(m) 



> ^2 {g#( x t+k-l + et+k-l, ■ ■ ■ , 

k=ij+l 



k=l 



xt+k-p + e t +k- 



xt+kY 



By the second inequality of (A.6) and the continuity, 
we have as 6 — > 6q and m — > oo, 



C(e 



t+ij-p 







(A.7) m- 1 J2^'{xt-i+k 

k=l 



i fc] (^l)} 2 



0. 



Next, we show that if \\0 — 9q\\ > 5 > 0, then as m 
oo there exists 5' > such that 

m 

,W/y^ Jfc] 

fc=l 



(A 



m 



■ 1 E{fP I w-fl2 1 TO} a >^>o. 



for some C > 0. Let j — > oo; we have 
Efc=^+i{^(^t+fc-i, ■ • ■ -x t +k} 2 = 0, which 

contradicts the assumption of a unique solution (A. 9). 

By (A.6), (A.7) and (A. 8), we have completed the 
proof of Theorem C. □ 



Theorem D. Recall the notation in Section 3.2 

and let St = (st, 0, . . . , 0) T andMt = (rjt, r/ t _ p+ i) T . 
We prove (A.8) by contradiction. Suppose the period For fhe nonlinear skdeton; we further assume that 



of the limit cycle is ir. For continuous dynamics, the 
assumption of a unique solution is equivalent to the 
statement that as i — > oo, 



i+ir 



(A.9) 



E {9e(X k ~i)-x k Y^0 

k=i+l 



go(x) has bounded second-order derivative with re- 
spect to in neighbor of *d for all possible values 
ofyt- Suppose that the assumptions (C1)-(C4) hold. 
Then 



T- 1/2 (9 {m} - tf m , w ) 4 n(o, n^A(n- 



1\T> 



<t — >• 9^9q. Specifically, for model (3.1) and yt = xt + t]t, if 

if t a ca a 4- 1. u +t, + • +1, ■ q iii i E\eA s < oo and E\m\ s < oo for some 5 > 4, i/ien 
if (A.8) does not hold, that is, there is a ir such that 11 1/1 J ' 



k=l 



then there must be a sequence {ij : j = 1, 2, . . .} with 

£j — >• oo as j — >• oo and 

ij+TT 

(A.10) ^ {^ fel (A t )-Xi +fc } 2 ^0 asj^oo. 

k=ij —p 

Let = gf ] {X t ) and e 4+fc = z t+k -x t+k . It follows 
from (A. 10) that for k = ij — p, . . . ,ij + 7r, 

|e t+fe |-)-0 asj-^oo, 

and that 

E {^(^t+fc-i +e t+ fc-i,..., 

k=ij + l 



A = Cov(A f ,A t ) 

+ E{ Cov ( A ^ A t-k) + Cov(A t _ fe , A t )}, 



fc=i 



fc=i 



00 



de T 



eJ($ k -V k )Xi 



d 2 j\x t ) 

d6d6 T 
Jk], 



+ e 



1 * d#<96> T 



with A t = J2T=i w k{Ylj=o ej&e^t+k-j + 7?i +fc } • 

(y t )/ae+{ e J"($ fc -^) A t -e7$ fc M} • <9^ ] (y t )/d0. 
For £/ie nonlinear model (3.5) and y t = xt + rjt, 



xt+k-p + et+fc. 



0. 



A = Var 



dg [ £ ] {y t -k) 
> ^Wk ^ 



,fc=l 



90 
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k=l 



dgfjyt) 
89 



and 



S7 = y^t^E 



fc=i 



dgf(y t )dg™(y t ) 
90 89 T 



we have E||^ jm || 5 < oo. It follows from Theorem 2.21 
of Fan and Yao [(2003), page 75] that 

£A t /VT4tf(o,£r A (*)Y 



t=i 



k=0 



Proof. Let Q{9) = £™ =1 w k E[y t+k 
and 



(Y t )f 



On the other hand, we have by (C3) and Proposi- 
tion 2.8 of Fan and Yao [(2003), page 74] 

d 2 Q n (e*) 



Q n {9) = ^WkT-^lvt+k 

k=l t=l 
m T-k 



8989 

T m 

^2T-£5> 



t=i k=\ 



x ;j(Y t )dj}(Y t ) 

89 89 T 



k=l 



t=l 



[yt+k - gf ] {Yt)} qqqqt 



o 2 i}(Y t ) 



dcf 



Qn(9). 



2fl. 



For model (3.1), we have Xt+i = $Xt + £t+i and 



Let 0{ m } = argmin^gQ Q n (9). We denote this by 9 
and $ mw by for simplicity. It is easy to see that X t+k = <$> k X t + (£ t+k + $£t +k -i + • • • + $ £t+i)- 
Qn(9) -> Q(0). Following the same argument of Wu Lgt ^ bg thg matrix $ when Q = ^ respectively . 



(1981), we have 9 — > "d in probability. 

By the definition of 9, we have 8Q n {9)/89 = 0. By 
Taylor expansion, we have 







(A.ll) 



8Qn(9) 

89 

dQM , 8 2 Qn(9*] 



+ 



0-0), 



Note that Y t = X t +M t . It follows that 
y t+k - * k Y t 

= (x t+k +N t+k ) - <S> k (X t +N t ) + (4> fc - * k )Y k 
= (£ t+k + <S>£ t+ k-i + ■■■ + ^ k - 1 £t+i) 
+ {M t+k - <$> k M t ) + (§ k - V k )Y t 



89 ' 8989 T 
where 9* is a vector between 9 and t9, and 
8QnW 



and 



y t+k - eJV k Y t 



89 



-IT-^w^lVt+k-gfiYt)] 



(A.12) 



fc=i t=i 



89 



fc-i 

(A. 14) = J^e^^eiet+fc-j + 7? t+fe 

+ e T($fe_*fc)x t -e^ r $ fc A/'f 
It follows from (A. 13) and (A. 14) that 

dgf{Y t ) 



T 



f=l 



where &, m = ET^Mvt+k ~ gf {Y t )]dg^ {Y t ) / 89 . 
By the definition of $, we have 8Q(i})/89 = 0, that 
is, 



2^w fc Eei 

fe=i 

We have 



<90 



0. 



<9^ 



(A.13) 



EC, 



0. 



T m 



Since is a strongly mixing process with exponen- 
tial decreasing mixing coefficients, so is £,t,m - By (C2), 



'k-l 

^2 ej $ J etet+k-j + r] t+k 

dj\Y t ) 
89 
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+ I {ej($ k - t> k )X t - e~[$ k Nt} 



dgf{Y t ) 
89 
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Therefore, it follows from (A. 11) that 

( oo 

D 



E 



{e^ ($ fe - ^ k )X t - e[ <S> k N t } 



89 



def 



t=i 



and that EA t = 0. Let 4 = 8(eJ^ k ) /d0. We further 
have 



■F = 4(^+M). 



96 89 
Since (X t ,et,rjt) is a stationary process and a strong- 
ly mixing sequence (Pham and Tran, 1985) with ex- 
ponentially decreasing mixing coefficients, and At is 
a function of {(X T , e T , 7] T ) :t = t,t — 1, . . . ,t — m}, it 
is easy to see that At is also a strongly mixing se- 
quence with exponentially decreasing mixing coeffi- 
cients. Note that EAt — and EjAtl" 5 < oo for some 
5 > 2. By Theorem 2.21 of Fan and Yao [(2003), 
page 75], we have 

£Vv^4tf[o,£;rA(fc)Y 



t=l 



k=0 



On the other hand, we have in probability 

8 2 Qn($) 



8989 1 



k=l t=l 



09 



89 T 



m T (k-l 

fc=l t=l [j=0 

+ eJ^ k -^ k )X t -eJ^ k Ar t \^£^ 



2]Tu; fc E 

k=l 

m 

-2^w k B 



09 



89 ] 



k=l 



ej (<f> k - ^ k )X t 



8989 T 



k=l 



1 * <90d6 T 



r-V2(fii _ ^) 4 jv ^ o, n- 1 ^r A (fc)(n _1 ; 



fc=0 



Next, consider model (3.5). Note that rjt-\ 



Vt+k 



(X t ). We have from (A. 12) that 



8Q n ($) 



89 



-2T 1 ^w k ^r jH 



k=\ t=l 
m T 



89 



k=l t=l 



i\Y t )] 



i\Y t ) 



t=i 



}^w k - 



09 



(Yt-k) 



k=l 



09 



Vt 



+ Y.Ma [ e^x t )-i\Y t )] 



k=l 



dgf{Y t ) 

89 



Let 



C m (x t -k,Vt-k) 



E 

_fc=i 

m 



% f'(^_ fc ) 

96 



??(F) 



fc=l 



dcf 



20. 



96 ' 

By (A. 13), wehaveE5 m (At,7/ t ) = 0. Thus B m (x t , rj t ) 
are independent with expectation 0. It is easy to see 
that Cm,t = C m (X t -k,i1t-k)vt + B m (X t ,rft) is a mar- 
tingale difference. The Lyapunov's condition is sat- 
isfied. Thus, we have 

T 

(A.15) T' 1 ' 2 ]T e f , m 4 iV{0, E(e m ,td,t)}- 
t=l 

Similarly to dQ n ("d)/d9 above, we have 

8 2 Qn(9*) 
8989 
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(A.16) 



t=i 



k=l 



09d9 T 



Vt 



T m 

t=i fc=i 



00 



(A.17) 



def 



fc=l 



20. 



50 d# T J 



Finally, from (A.ll), (A. 15) and (A.16) we have 

T-v\e - ■&) 4 iv{o, cr^&^jjr 1 }. 

We have completed the proof. □ 
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